Deriving Word Association Networks from Text Corpora
نویسندگان
چکیده
This article presents and evaluates a model to automatically derive word association networks from text corpora. Two aspects were evaluated: To what degree can corpus-based word association networks (CANs) approximate human word association networks with respect to (1) their ability to quantitatively predict word associations and (2) their structural network characteristics. Word association networks are the basis of the human mental lexicon. However, extracting such networks from human subjects is laborious, time consuming and thus necessarily limited in relation to the breadth of human vocabulary. Automatic derivation of word associations from text corpora would address these limitations. In both evaluations corpusbased processing provided vector representations for words. These representations were then employed to derive CANs using two measures: (1) the well known cosine metric, which is a symmetric measure, and (2) a new asymmetric measure computed from orthogonal vector projections. For both evaluations, the full set of 4068 free association networks (FANs) from the University of South Florida word association norms were used as baseline human data. Two corpus based models were benchmarked for comparison: a latent topic model and latent semantic analysis (LSA). We observed that CANs constructed using the asymmetric measure were slightly less effective than the topic model in quantitatively predicting free associates, and slightly better than LSA. The structural networks analysis revealed that CANs do approximate the FANs to an encouraging degree.
منابع مشابه
Constructing Word-Sense Association Networks from Bilingual Dictionary and Comparable Corpora
A novel thesaurus named a word-sense association network is proposed for the first time. It consists of nodes representing word senses, each of which is defined as a set consisting of a word and its translation equivalents, and edges connecting topically associated word senses. This word-sense association network is produced from a bilingual dictionary and comparable corpora by means of a new...
متن کاملAn Improved Method for Deriving Word Meaning from Lexical Co-Occurrence
The lexical semantic system is an important component of human language and cognitive processing. One approach to modeling semantic knowledge makes use of hand-constructed networks or trees of interconnected word senses (Miller, Beckwith, Fellbaum, Gross, & Miller, 1990; Jarmasz & Szpakowicz, 2003). An alternative approach seeks to model word meanings as high-dimensional vectors, which are deri...
متن کاملWord Association Thesaurus as a Resource for extending Semantic Networks
The paper reports the on-going research for applying psycholinguistic resources to building and extending semantic networks. We survey different kinds of information that can be extracted from a Word Association Thesaurus (WAT), a resource representing the results of a large-scaled free association test. In addition, we give a comparison of WAT and other language resources (e.g. text corpora, e...
متن کاملAn Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
The lexical semantic system is an important component of human language and cognitive processing. One approach to modeling semantic knowledge makes use of hand-constructed networks or trees of interconnected word senses (Miller, Beckwith, Fellbaum, Gross, & Miller, 1990; Jarmasz & Szpakowicz, 2003). An alternative approach seeks to model word meanings as high-dimensional vectors, which are deri...
متن کاملAutomatic Acquisition of a High-Precision Translation Lexicon from Parallel Chinese-English Corpora
This paper presents a hybrid approach to deriving a translation lexicon from unaligned parallel Chinese-English corpora. Two types of information, namely, proximity and document-external distributions of word pairs, are proposed to enhance the precision of the translation lexicon derived from statistical and dictionary-based methods. The former can identify translations of Chinese compounds, wh...
متن کامل